PySCNet: A tool for reconstructing and analyzing gene regulatory network from single-cell RNA-Seq data

PySCNet includes four modules:

1) Pro-precessing: initialize a gnetData object consisting of Expression Matrix, Cell Attributes, Gene Attributes and Network Attributes;
2) BuildNet: reconstruct GRNs by various methods implemented in docker;
3) NetEnrich: network analysis including consensus network detection, gene module identification and trigger path prediction as well as network fusion;
4) Visulization: network illustration.

A python package - STREAM was designed for reconstructing cell trajectory for single cell transcriptomic data. This tutorial guides how to integrate STREAM with pyscnet for gene regulatory network along the cell differential trajectory.

In [1]:
from __future__ import absolute_import
import warnings
warnings.filterwarnings("ignore")

import sys
import os
import itertools
import scanpy as sc

sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
sc.settings.set_figure_params(dpi=80, facecolor='white')

from pyvis.network import Network
import pandas as pd
import copy
import numpy as np
import matplotlib.pyplot as plt 
from pyscnet.Preprocessing import gnetdata
from pyscnet.BuildNet import gne_dockercaller as gdocker
from pyscnet.BuildNet import gne_synchrony as gs
from pyscnet.NetEnrich import graph_toolkit as gt

pd.set_option('display.max_rows', 1000)
scanpy==1.5.1 anndata==0.7.4 umap==0.4.6 numpy==1.17.5 scipy==1.5.2 pandas==1.0.5 scikit-learn==0.22.2.post1 statsmodels==0.11.1

Data resource

Data was obtained from Nestorowa, S. et al. The cell trajectory was build according to STREAM Tutorial. As STREAM is also built on AnnData structure, it can be directly imported into pyscnet.

In [2]:
#the file is too large
# import _pickle as pk
# with open('data/stream_adata.pk', 'rb') as input:
#     adata = pk.load(input)
#user jupyter-notebook in stream env: ~/miniconda3/envs/stream/bin/
import stream as st
adata = st.read('data/stream_adata.pklz')
Working directory is already specified as './result' 
To change working directory, please run set_workdir(adata,workdir=new_directory)

As explained in STREAM Tutorial, new attributes including cell pseudotime at different branches are added into obs of AnnData. It gives us the hints of ordering cells by pseudotime and estimating gene dynamics along cell trajectory.

In [3]:
adata.var_names
Out[3]:
Index(['Gm28070', 'Gm16322', 'Gm8495', 'Igkv3-11', 'Gm22363', 'Gcfc2',
       'Gm24592', 'Rho', 'Gm25111', 'Slco2a1',
       ...
       'Impact', 'Dsg3', 'Fam13b', 'Rprd1a', 'Taf4b', 'Zfp521', 'Hrh4', 'Dsc1',
       'Psma8', '2700062C07Rik'],
      dtype='object', length=35077)

Cell trajectories built by stream

As shown below, 6 cell sub-populations form a Bifurcation trajectory starting from HSC expanding into two directions. One is MEP branch and the other one is GMP branch. Additionally, leaf markers and transition markers are detected for each branch. It guides us for feature selection

In [4]:
st.plot_stream_sc(adata,root='S3',color=['label'],
                  dist_scale=0.1,show_graph=True,show_text=True)
#                   save_fig=True, fig_path='result/cell_branches')
In [5]:
#leaf_markers of cell pathway: S1 -> S2
adata.uns['leaf_markers'][('S1','S2')].head()
Out[5]:
zscore H_statistic H_pvalue S3S1_pvalue S1S2_pvalue S1S0_pvalue
Mfsd2b 1.41257 938.321 1.76265e-204 2.78863e-209 1 5.62722e-271
Lcp1 -1.41168 838.086 1.02799e-182 1.04374e-144 1 7.64013e-242
Aqp1 1.41421 830.173 5.37295e-181 7.15629e-161 1 5.32332e-230
Gata1 1.41361 829.465 7.65578e-181 1.96117e-162 1 9.52239e-229
Klf1 1.41378 801.407 9.47857e-175 2.91898e-158 1 5.26742e-215
In [7]:
st.plot_stream(adata,root='S3',color=['Gata1','Klf1'])
In [8]:
#Import stream anndata into pyscnet
stream_gne = gnetdata.load_from_scanpy(adata)

Select trajectory branch and Features for GRN:

In the following, cells from S1 to S2 and positive leaf markers of (S1->S2) will be selected for GRN construction. As shown below, positive leaf markers are only highly expressed in cells of S1 -> S2. There are 236 genes and 535 cells selected for downstream analysis.

In [9]:
#select cells from S1 -> S2
cell_info = stream_gne.CellAttrs['CellInfo'].sort_values('S1_pseudotime', ascending=True)
cell = list(cell_info.loc[cell_info['branch_id_alias'].isin([('S2', 'S1')])].index)
len(cell)
Out[9]:
535
In [10]:
#select positive leaf markers 
feature = adata.uns['leaf_markers'][('S1', 'S2')]
# feature = feature.sort_values('zscore', ascending=False).head(100).index
feature = feature[feature.zscore > 0].index

# with open('result/S0_S1_leafMarker.txt', 'w') as file:
#     file.writelines(["%s\n" % item for item in feature])

Remove isolated nodes:

In order to evaluate the performance of different GRN methods, reference linkage was download from string PPI filtered by high confidence score > 0.7 and PPI enrichment p-value < 1.0e-16. There are 420 edges left among the 117 features. 119 isolated ndoes without connecting to any others were removed for GRN construction.

In [11]:
import networkx as nx
Ref_edges = pd.read_csv('/home/mwu/Downloads/string_interactions_2.tsv', 
                          sep='\t').iloc[:,:2]

Ref_edges.columns = ['source', 'target']
Ref_edges['weight'] = 1

G = nx.from_pandas_edgelist(Ref_edges)
feature = list(set(feature) & set(G.nodes))

len(feature)
Out[11]:
117
In [12]:
# Set window size to compute moving window synchrony.
r_window_size = 50
subExpr = stream_gne.ExpMatrix.loc[feature, cell]
df = subExpr.T
genes = ['Mki67', 'Top2a']
# Interpolate missing data.
df_interpolated = df.interpolate()
# Compute rolling window synchrony
rolling_r = df_interpolated[genes[0]].rolling(window=r_window_size, center=True).corr(df_interpolated[genes[1]])
f,ax=plt.subplots(2,1,figsize=(14,6),sharex=True)
df[genes].rolling(window=r_window_size,center=True).mean().plot(ax=ax[0])
ax[0].set(xlabel='Frame',ylabel='Expression level')
rolling_r.plot(ax=ax[1])
ax[1].set(xlabel='Frame',ylabel='Pearson r')
plt.suptitle("pairwise gene and rolling window correlation")
Out[12]:
Text(0.5, 0.98, 'pairwise gene and rolling window correlation')

As reported in GRN benchmark paper, although some tools were designed for estimating gene regulatory relationship from transcriptomic data, only three methods (GENIE3, PIDC and GRNBOOST2) were considered as competitive. Besides, two additional methods (window_rolling and phase synchrony) are also provided in the pyscnet.

In [ ]:
stream_gne = gs.get_synchrony(stream_gne.deepcopy, method='window_rolling',
                               feature=feature, cell=cell)

stream_gne = gs.get_synchrony(stream_gne.deepcopy, method='phase_synchrony',
                               feature=feature, cell=cell)
In [ ]:
#GRNs for MEP branch
stream_gne = gdocker.rundocker(stream_gne.deepcopy, method='GENIE3', directed=False,
                               feature=feature, cell=cell)

stream_gne = gdocker.rundocker(stream_gne.deepcopy, method='PIDC', directed=False,
                               feature=feature, cell=cell)

stream_gne = gdocker.rundocker(stream_gne.deepcopy, method='CORR', directed=False,
                               feature=feature, cell=cell)

stream_gne = gdocker.rundocker(stream_gne.deepcopy, method='GRNBOOST2', directed=False,
                               feature=feature, cell=cell)

Assess confidence of different GRNs

To quality the confidence of GRNs obtained from differen methods, top 420 edges were selected from each methods. ROC and PR curves were ultilized for methods evaluation.

In [14]:
Ref_edges = pd.read_csv('/home/mwu/Downloads/string_interactions_2.tsv', 
                          sep='\t').iloc[:,:2]

Ref_edges.columns = ['source', 'target']
Ref_edges['weight'] = 1
Ref_edges.shape

#remove self-loop edges
index_list=list()
for i in range(Ref_edges.shape[0]):
    if Ref_edges.iloc[i]['source'] == Ref_edges.iloc[i]['target']:
        index_list.append(i)
Ref_edges=Ref_edges.drop(index_list)

#As edges are undirected, the following is to create Full reference links (117*116/2) and then merge with the string links 
from itertools import product, permutations, combinations, combinations_with_replacement
Full_Ref = pd.DataFrame(combinations(set(feature), 2), columns=['source', 'target'])
Full_Ref['weight'] = 0

Ref_edges['name'] = Ref_edges['source'] + '_' + Ref_edges['target']
Full_Ref['name'] = Full_Ref['source']+ '_' + Full_Ref['target']

i = 0
for name in Ref_edges.name:
    tmp = list(name.split('_'))
    
    if name in list(Full_Ref['name']):
        
        
        Full_Ref.loc[Full_Ref.name ==  name, 'weight'] = 1
    
    elif tmp[1] + '_' + tmp[0] in list(Full_Ref['name']):
        
        
        Full_Ref.loc[Full_Ref.name == tmp[1] + '_' + tmp[0], 'weight'] = 1
    
    else:
        i = i +1
        
Full_Ref[Full_Ref.weight == 1].shape
Out[14]:
(420, 4)

Phase synchrony based method outperforms others

As shown below, phase synchrony method outperforms others in terms of top 420 edges.

In [17]:
import alg_evaluation

dict_of_algs = dict()
link_keys = list(filter(lambda x:'_links' in x, stream_gne.NetAttrs.keys()))

for link in link_keys:
    method = link.split('_links')[0]
    top = stream_gne.NetAttrs[link].sort_values('weight', ascending=False).head(420)
    dict_of_algs[method] = alg_evaluation.evaluation(top, copy.deepcopy(Full_Ref))

    
fig = plt.figure(figsize=(20, 8))
grid = plt.GridSpec(1, 2, wspace=0.2, hspace=0.2)
ax1 = fig.add_subplot(grid[0, 0])
ax2 = fig.add_subplot(grid[0, 1])

filename='test'
alg_evaluation.build_curves(ax1, dict_of_algs, 'ROC', filename, linewidth=3)
alg_evaluation.build_curves(ax2, dict_of_algs, 'PCR', filename, linewidth=3)

fig.savefig('result/test_consensus_confidence.pdf')

NetEnrich: Graph travel and supervised random walk

NetEnrich module integretes graph traversal techniques to expolre gene regulatory network. It includes Breadth-first search, Depth-first search and Supervised random walk. In this way, gene trigger paths indicating genes with hidden associations can be predicted.

In [18]:
#build graph
from pyscnet.NetEnrich import graph_toolkit as gt
stream_gne = gt.buildnet(stream_gne, key_links='phase_synchrony_links', top=420)

#calculate node centrality
stream_gne = gt.get_centrality(stream_gne)

#detect gene community
stream_gne = gt.detect_community(stream_gne)
graph added into NetAttrs
node centralities added into NetAttrs.
gene communities added into NetAttrs

Missing linkages can be detected by supervised random walk

We take node Mki67 as a starting point and perform a 5 steps random walk guided by node centrality (degree). It generates a random path (Mki67-Rrm1-Rrm2-Mcm5-Mcm7-Pa2g4). Superisely, three co-regulation edges (Mki67-Rrm2, Mki67-Mcm5, Mki67-Mcm7) missed in prediction were detected via random walk.

In [19]:
# Breadth-first search to explore gene network
# bfs_path = list(gt.graph_traveral(stream_gne.NetAttrs['graph'], start='Mki67', threshold=1, method='dfs'))
# bfs_path

# #Supervised random walk
random_path = gt.random_walk(stream_gne, start = 'Mki67', supervisedby='degree', steps=5)
random_path
Out[19]:
['Mki67', 'Rrm1', 'Rrm2', 'Mcm5', 'Mcm7', 'Pa2g4']
In [20]:
import graph_tool.all as gta
import matplotlib.pyplot as plt
from pyscnet.Plotting import _nx2gt as nx2gt

#select top 420 edges
tmp = stream_gne.NetAttrs['phase_synchrony_links'].sort_values('weight', ascending=False).head(420)
net1 = Ref_edges.loc[(Ref_edges.target=='Mki67') | (Ref_edges.source == 'Mki67')]
net2 = tmp.loc[(tmp.source=='Mki67') | (tmp.target=='Mki67')]
net3 = tmp.loc[(tmp.source.isin(['Mki67'])) | (tmp.target.isin(['Mki67', 'Rrm1', 'Rrm2', 'Mcm5', 'Mcm7', 'Pa2g4']))]


plt.switch_backend("cairo")

fig, ax = plt.subplots(1, 2, figsize=(20, 8))

G1 = nx2gt.nx2gt(nx.from_pandas_edgelist(net1, edge_attr='weight'))
pos = gta.fruchterman_reingold_layout(G1)

edge_color = G1.new_edge_property("string")
labels = dict(zip(list(range(G1.num_vertices())), list(G1.vertex_properties['id'])))

gta.graph_draw(G1, vertex_text=G1.vertex_properties['id'], vertex_text_position=-2,
               vertex_size=0.8,  vertex_font_size=0.2, mplfig=ax[0])

ax[0].set_title('Edges connected to Mki67 in StringPPI')

G1 = nx2gt.nx2gt(nx.from_pandas_edgelist(net2, edge_attr='weight'))
pos = gta.fruchterman_reingold_layout(G1)

edge_color = G1.new_edge_property("string")
labels = dict(zip(list(range(G1.num_vertices())), list(G1.vertex_properties['id'])))

gta.graph_draw(G1, vertex_text=G1.vertex_properties['id'], vertex_text_position=-2,
               vertex_size=0.8,  vertex_font_size=0.2, mplfig=ax[1])

ax[1].set_title('Edges connected to Mki67 in prediction')
Out[20]:
Text(0.5, 1.0, 'Edges connected to Mki67 in prediction')
In [21]:
fig, ax = plt.subplots(1, figsize=(20, 8))
G1 = nx2gt.nx2gt(nx.from_pandas_edgelist(net3, edge_attr='weight'))
pos = gta.fruchterman_reingold_layout(G1)

edge_color = G1.new_edge_property("string")
labels = dict(zip(list(range(G1.num_vertices())), list(G1.vertex_properties['id'])))
highlight_path=random_path
for e in G1.edges():
    if (highlight_path is not None) and (labels[list(e)[0]] in highlight_path) and (
            labels[list(e)[1]] in highlight_path) and abs(highlight_path.index(labels[list(e)[0]]) - highlight_path.index(labels[list(e)[1]])) == 1:
        edge_color[e] = 'darkorange'
    else:
        edge_color[e] = 'lightgrey'

pos = gta.arf_layout(G1)
gta.graph_draw(G1, vertex_text=G1.vertex_properties['id'], vertex_text_position=-2,
               edge_color=edge_color, vertex_size=1.2,  vertex_font_size=0.3, mplfig=ax)


ax.set_title('5 steps random walk starting from Mki67')
Out[21]:
Text(0.5, 1.0, '5 steps random walk starting from Mki67')

Plotting: Network Visualization

Plotting module principlely provides a set of functions for visualization.

In [22]:
from pyscnet.Plotting import net_plot as npl

plt.switch_backend("cairo")
fig, ax = plt.subplots(1, 2, figsize=(40, 18))
# npl.net_matrix_plot(stream_gne, output_size=(500, 500), vertex_text_position=-2, vertex_font_size=5)
npl.net_hierarchy_plot(stream_gne, vertex_text_position=-2, vertex_size=0.2,  vertex_font_size=0.05, mplfig=ax[0])
npl.net_matrix_plot(stream_gne, vertex_text_position=-2, vertex_size=0.8,  vertex_font_size=0.2, mplfig=ax[1])
In [24]:
from pyscnet.Plotting import __geneDance_plot as gp
cg = gp.geneHeatmap(stream_gne, gene=['Mki67', 'Top2a', 'Ccnb2'], figsize=[15,1.5], 
                    cell_clusterid=('S2', 'S1'), select_by='branch_id_alias', yticklabels=False,
                    order_by='S1_pseudotime', col_cluster=False, cmap='RdBu_r')

cg.ax_row_dendrogram.set_visible(False)
In [25]:
# #gene correlation heatmap

gp.geneCorrelation(stream_gne, gene=feature,
                cell_clusterid=('S2', 'S1'), select_by='branch_id_alias',
                order_by='S1_pseudotime', xticklabels=False, yticklabels=False,
                save_as='result/gene_correlation.pdf', figsize=[5, 5])
Out[25]:
<AxesSubplot:>
In [13]:
#save stream_gne as pickle 
# stream_gne.save_as('data/pyscnet_stream.pk')
stream_gne = gnetdata.load_Gnetdata_object('data/pyscnet_stream.pk')
In [ ]:
import openpyxl
tmp = adata.uns['leaf_markers'][('S1', 'S2')].loc[feature]
tmp.to_excel('result/selected_features.xlsx')